Principal Component Analysis for Large Scale Problems with Lots of Missing Values

نویسندگان

Tapani Raiko

Alexander Ilin

Juha Karhunen

چکیده

Principal component analysis (PCA) is a well-known classical data analysis technique. There are a number of algorithms for solving the problem, some scaling better than others to problems with high dimensionality. They also differ in their ability to handle missing values in the data. We study a case where the data are high-dimensional and a majority of the values are missing. In case of very sparse data, overfitting becomes a severe problem even in simple linear models such as PCA. We propose an algorithm based on speeding up a simple principal subspace rule, and extend it to use regularization and variational Bayesian (VB) learning. The experiments with Netflix data confirm that the proposed algorithm is much faster than any of the compared methods, and that VB-PCA method provides more accurate predictions for new data than traditional PCA or regularized PCA.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extensions of probabilistic PCA

Principal component analysis (PCA) is a classical data analysis technique. Some algorithms for PCA scale better than others to problems with high dimensionality. They also differ in the ability to handle missing values in the data. In our recent paper [1], a case is studied where the data are high-dimensional and a majority of the values are missing. In the case of very sparse data, overfitting...

متن کامل

Video Subject Inpainting: A Posture-Based Method

Despite recent advances in video inpainting techniques, reconstructing large missing regions of a moving subject while its scale changes remains an elusive goal. In this paper, we have introduced a scale-change invariant method for large missing regions to tackle this problem. Using this framework, first the moving foreground is separated from the background and its scale is equalized. Then, a ...

متن کامل

Missing Value Estimation of Epistatic Miniarray Profiling Data by Kernel Pca Regression Ensemble Approach

Missing data imputation is a key issue in learning from incomplete data. Various techniques have been developed with great success on dealing with missing values in data sets with heterogeneous attributes (their independent attributes are of different types) referred to as imputing mixed-attribute data sets. Epistatic miniarray profiling (E-MAP) is a powerful tool for analyzing gene functions a...

متن کامل

Comparison of Variational Bayes and Gibbs Sampling in Reconstruction of Missing Values with Probabilistic Principal Component Analysis

Lately there has been the interest of categorization and pattern detection in large data sets, including the recovering of the dataset missing values. In this project the objective will be to recover the subset of missing values as accurately as possible from a movie rating data set. Initially the data matrix is preprocessed and its elements are divided in training and test sets. Thereafter the...

متن کامل

DISCRETE AND CONTINUOUS SIZING OPTIMIZATION OF LARGE-SCALE TRUSS STRUCTURES USING DE-MEDT ALGORITHM

Design optimization of structures with discrete and continuous search spaces is a complex optimization problem with lots of local optima. Metaheuristic optimization algorithms, due to not requiring gradient information of the objective function, are efficient tools for solving these problems at a reasonable computational time. In this paper, the Doppler Effect-Mean Euclidian Distance Threshold ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

Principal Component Analysis for Large Scale Problems with Lots of Missing Values

نویسندگان

چکیده

منابع مشابه

Extensions of probabilistic PCA

Video Subject Inpainting: A Posture-Based Method

Missing Value Estimation of Epistatic Miniarray Profiling Data by Kernel Pca Regression Ensemble Approach

Comparison of Variational Bayes and Gibbs Sampling in Reconstruction of Missing Values with Probabilistic Principal Component Analysis

DISCRETE AND CONTINUOUS SIZING OPTIMIZATION OF LARGE-SCALE TRUSS STRUCTURES USING DE-MEDT ALGORITHM

عنوان ژورنال:

اشتراک گذاری